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Abstract 

Wc use the theory of normal variance-mean mixtures to derive a data-augmentation 
scheme for a class of common regularization problems. This generalizes existing theory 
on normal variance mixtures for priors in regression and classification. It also allows 
variants of the expectation-maximization algorithm to be brought to bear on a wider 
range of models than previously appreciated. We demonstrate the method on several 
examples, including sparse quantile regression and binary logistic regression. We also 
show that quasi-Newton acceleration can substantially improve the speed of the algo- 
rithm without compromising its robustness. 

Key words: Data augmentation; expectation-maximization; Sparsity; Variance-mean 
mixture of normals 

1 Introduction 

1.1 Regularized regression and classification 

Many problems in regularized estimation involve an objective function of the form 

n p 

1=1 j=i 

Here is a response, which may be continuous, binary, or multinomial; Xj is a known 
p- vector of predictors; f3 = {Pi, . . . , f3p) is an unknown vector of coefficients; / and g are 
the negative log likelihood and penalty function, respectively; and a and r are scale param- 
eters, for now assumed fixed. We may phrase the problem as one of minimizing L((3), or 
equivalently maximizing the unnormalized posterior density exp{— L(/3)}, interpreting the 
penalty as a negative log prior. 



In this paper, we unify many disparate problems of this form into a single class of normal 
variance-mean mixtures. This has two practical consequences. First, it facilitates the use of 
penalty functions in models with rich structure, such as hierarchical non-Gaussian models, 
discrete mixtures of generalized linear models, and missing-data problems. 

Second, it can be used to derive simple expectation-maximization algorithms for non- 
Gaussian models. This leads to a stable, unified approach to estimation in many problems, 
including quantile regression, logistic regression, negative-binomial models for count out- 
comes, support-vector machines, and robust regression. 

The key result is Proposition [T| which describes a simple relationship between the 
derivatives of / and g and the conditional sufficient statistics that arise in our expectation- 
maximization algorithm. The expected values of these conditional sufficient statistics can 
usually be calculated in closed form, even if the full conditional distribution of the latent 
variables is unknown or intractable. Proposition [3] provides the correponding result for the 



posterior mean estimator, generalizing the results oflMasreliez ( 1975 ) and Pericchi &: Smith 



(1992). 



One known disadvantage of expectation-maximization algorithms is that they may con- 



verge slowly, particularly if the fraction of missing information is large ( Meng &; van Dyk 



1997). Our method, in its basic form, is no exception. But we find that substantial gains 



in speed are possible using quasi-Newton acceleration (Lange, 1995). The resulting ap 



proach combines the best features of the expectation-maximization and Newton-Raphson 
algorithms: robustness and guaranteed ascent when far from the solution, and super-linear 
convergence when close to it. 



1.2 Relationship with previous work 

Our work is motivated by recent Bayesian research on sparsity-inducing priors in linear 
regression, where / = ||y — X/3|p is the negative Gaussian log likelihood, and g corresponds 



to a normal variance-mixture prior (Andrews k. Mallows, 1974). Examples of work in this 



area include the lasso (Tibshirani, 1996: Park & Casella, 2008 Hans, 2009); the bridge est! 



mator (West, 1987 Knight & Fu, 2000 Huang et al. 2008); the relevance vector machine of 



Tipping (2001); the normal/ Jeffreys prior of Figueiredo (2003) and Bae fc Mallick (2004); 



the normal/exponential-gamma model of Griffin &; Brown (2012); the normal/gamma and 



normal/inverse-Gaussian models (Caron &; Doucet, 2008: Griffin &; Brown, 2010); the horse 



shoe prior of Carvalho et al. (2010); the hypergeometric inverted-beta model of Poison & 



Scott (2012); and the double-Pareto model of Armagan et al. (2012). See Poison & Scott 



(201 la) for a review. 



We generalize this work by representing both the likelihood and prior as variance-mean 
mixtures of Gaussians. This data-augmentation approach relies upon the following decom- 
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Table 1: Variance- mean mixture representations for many common foss functions. Recall 
that Zi = Ui — xff3 for regression, or Zi = yixf/3 for binary classification. 
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where the working response Zj is equal to yi — xj(3 for Gaussian regression, or Vixf/S for 
binary classification using logistic regression or support- vector machines. Here yi is coded as 
±1. Both a and r are hyperparameters; they are typically estimated jointly with f3, although 
they may also be specified by the user or chosen by cross-validation. In some cases, most 
notably in logistic regression, the likelihood is free of hyperparameters, in which case a does 



not appear in the model. Previous studies (e.g. Poison &: Scott, 2011b Gramacy &: Poison 



2012) have presented similar results for specific models, including support- vector machines 
and the so-called powered-logit likelihood. Our paper characterizes the general case. 

One thing we do not do is to study the formal statistical properties of the resulting 
estimators, such as consistency as p and n diverge. For this we refer the reader to Fan & 



Li (2001) and Griffin & Brown (2012), who discuss this issue from classical and Bayesian 



viewpoints, respectively. 



2 Normal variance-mean mixtures 

There are two key steps in our approach. First, we use variance- mean mixtures, rather 
than just variance mixtures. Second, we interweave two different mixture representations, 
one for the likelihood and one for the prior. The introduction of latent variables {wj} and 
{Xj} in Equations ^ and Q, below, reduces exp{— L(/3)} to a Gaussian linear model with 
heteroscedastic errors: 

/•oo 

p{zi \ p,a) = / cl){zi I fiz + K^Wj"^ cj^cjj"^) dP{uji) , (2) 
Jo 

POO 

PWj \r) = ^{^j I iif, + K^jAji, r^ATi) dP{\j) , (3) 
where <^{a \ m, v) is the normal density, evaluated at a, for mean m and variance v. 
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By marginalizing over these latent variables with respect to different fixed combinations 
of {fiz^ ^ip^ Kp) and the mixing measures P{Xj) and P{uji), it is possible to generate 
many commonly used objective functions that have not been widely recognized as Gaussian 
mixtures. Table[T]lists several common likelihoods in this class, along with the corresponding 
fixed choices for {Kz,fJ-z) and the mixing distribution P{oJi). 

A useful fact is that one may avoid dealing directly with conditional distributions for 
these latent variables. To find the posterior mode, it is sufficient to use Proposition [T] to 
calculate moments of these distributions. These moments in turn depend only upon the 
derivatives of / and 5, along with the hyperparameters. 

We focus on two choices of the mixing measure: the generalized inverse-Gaussian distri- 
bution and the Polya distribution, the latter being an infinite sum of exponential random 



variables ( Barndorff-Nielsen et al. , 1982). These choices lead to the hyperbolic and Z dis- 
tributions, respectively, for the variance-mean mixture. The two key integral identities 
are 



2 2 

— n -a\e-^l\+K{e-^l) 



2a 



-e 







B{a,K) (1 + e^-M)2("-«) 



(6* I /i + KV, v) pg{v I 1,0, (q^ — K^)^/^} dv , and 
(6* I /Li + KV, v) p'p{v \ a, a — 2k) dv , (4) 







where pg and p-p are the density functions of the generalized inverse-Gaussian and Polya 
distributions, respectively. For details, see Appendix |A} We use 9 to denote a dummy 
argument that could involve either data or parameters, and v to denote a latent variance. 
All other terms are pre-specified in order to represent a particular density or function. These 
two expressions lead, by an application of the Fatou-Lebesgue theorem, to three further 
identities for the improper limiting cases of the two densities above: 

exp {— 20^"*^ max(a6', 0)} = / <p{9 \ —av, cv) dv , 

Jo 

/•oo 

c-^exp{-2c-^Pq{9)} = / (f){9\v-2TV,cv)e-^^^^-^> dv, 

Jo 

l-OO 

(1 + exp{6l - /i})"^ = / (l){9\ ^x-v/2,v) p'p{v\Q,l) dv , 
Jo 

where Pq{9) = 16*1/2 + (g — 1/2) 9 is the check-loss function. The first leads to the pseudo- 
likelihood for support-vector machines, the second to quantile and lasso regression, and the 
third to logistic and negative binomial regression under their canonical links. The function 
Pviv I 0, 1) is an improper density corresponding to a sum of exponential random variables. 
In the case of a multinomial model, the conditional posterior for a single category, given 
the parameters for other categories, is a binary logistic model. 
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3 An expectation-maximization algorithm 



3.1 Overview of approach 

By exploiting conditional Gaussianity, models of the form can be fit using an 

expectation-maximization algorithm (Dempster et al. , 1977). In the expectation step, one 
computes the expected value of the log posterior, given the current estimate (3^^^: 

C{f3 I ^^9^) = J logp(/3 I uj,X,T,a,z)p{uj,X\ P^^\t,z) du dX . 

Then in the maximization step, one maximizes the complete-data posterior as a function of 
/3. The same representation can also be used to derive Markov-chain sampling algorithms, 
although we do not explore this aspect of the approach. 

The expectation-maximization algorithm has several advantages here compared to other 
iterative methods. It is highly robust to the choice of initial value, has no user-specified 
tuning constants, and leads to a sequence of estimates f3^'^\ . . . } that monotonically 

increase the observed-data log posterior density. The disadvantage is the potential for slow 
convergence. For the models we consider, this disadvantage is real, but can be avoided using 
quasi-Newton acceleration. 

The complete-data log posterior can be represented in two ways. First, we have 

1 " 2 
logp(/3 I uj, A, r, a, z) = co{uj, A, r, cr, z) - ^ [zi - fiz - i^y^^^) 

i=l 

for some constant cq, recalling that Zi = m — xf (3 for regression or Zi = yixj (3 for classifi- 
cation. Factorizing this further as a function of (3 yields a second representation: 

^ n n 

logp(/3 I w. A, r, fj, z) = ci{uj, A, r, cr, z) - ^ ^ uji {zi - n^f + i^-y^ {zi - fJ-z) 

^ i=l 1=1 
^ fc p 

- ^ E ^^(z^, - + 5^(/3, - (6) 

for some constant ci. We now derive the expectation and maximization steps, along with 
the complete-data sufficient statistics. 



3.2 The expectation step 

From ([g]), observe that the complete-data objective function is linear in both uii and Aj. 
Therefore, in the expectation step, we replace Aj and cj, with their conditional expectations 
Xf and given the data and the current The following result provides expressions 
for these conditional moments under any model where both the likelihood and prior can be 
represented by normal variance-mean mixtures. 
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Proposition 1. Suppose that the objective function L[(3) in ([7p can be represented by 
a hierarchical variance-mean Gaussian mixture, as in Equations ^ and Q). Then the 
conditional moments Xj = E (Xj \ /3, r, z) and uji = E (wj | u, Zi) are given by 

[Pj - /^/3)Aj = Kp + r"^ g'{(3j \ r) , {zi - p,-^)u:i = + cr"^ f'{zi | /3, a) , 

where f and g' are the derivatives of f and g from 

This characterizes the required moments in terms of the likelihood f{zi) = — logp{zi \ 
(3, a) and penalty function g{l3j) = — logp(/3j | r). One caveat is that when f3j — 0, 
the conditional moment for Xj in the expectation step may be numerically infinite, and care 
must be taken. Indeed, infinite values for Xj will arise naturally under certain sparsity- 
inducing choices of g, and indicate that the algorithm has found a sparse solution. One way 
to handle the resulting problem of numerical infinities is to start the algorithm from a value 
where {f3 — fip) has no zeros, and to remove f3j from the model when it gets too close to its 
prior mean. This conveys the added benefit of hastening the matrix computations in the 
maximization step. Although we have found this approach to work well in practice, it has the 
disadvantage that a variable cannot re-enter the model once it has been deleted. Therefore, 
once a putative zero has been found, we propose small perturbations in each component of 
(3 to assess whether any variables should re-enter the model. Our method does not sidestep 
the problem of optimization over a combinatorially large space. In particular, it does not 
guarantee convergence to a global maximum if the penalty function is not convex, in which 
case restarts from different initial values will be necessary to check for the presence of local 
modes. 



3.3 The maximization step 

Returning to ([5]), the maximization step involves computing the posterior mode under a 
heteroscedastic Gaussian error model and a conditionally Gaussian prior for /3. 

Proposition 2. Suppose that the objective function L{(3) in |Ip can be represented by 
variance-mean Gaussian mixtures, as in (^-(^. Then given estimates Cji and Xj, we have 
the following expressions for the conditional maximum of /3, where oj = (wi, . . . and 
X = (Ai, . . . , Xp) are vectors, and where = diag{uji, . . . , w„) and A = diag{Xi, . . . , Xp). 

1. In a regression problem, 

/3 = {T-^k+X^ClXy\y^+h*) ,y* = X^ ({ly - fi.u; - , b* = T-^^i^X+Kf^ln) , 

where 1„, indicates a vector of ones. 

2. In a binary classification problem where = ±1 and X^, has rows x* = yiXi, 
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One may also use a series of conditional maximization steps for the regression coefficients 
(3 and for hyperparameters such as a and r. These latter steps exploit standard results on 
variance components in linear models, which can be found in, for example, Gelman (2006). 



3.4 Quasi-Newton acceleration 

There are many strategies for speeding up the expectation-maximization algorithm. A 
very simple approach that requires no extra analytical work, at least in our case, is to 
use quasi-Newton acceleration. The idea is to decompose the observed-data likelihood 
as L{f3) = C{j3) — R{(3), where C(/3) is the complete-data contribution and H(fi) is an 
unknown remainder term. This leads to a corresponding decomposition for the Hessian of 
L(/3): -V^L{(3) = -V^C{p) + V^R{(3). In models of the form (g-g, -V^C{P) is simply 
the inverse covariance matrix of the conditional normal distribution for (3, already given. 
The Hessian of the remainder, V^i?(/3), can be iteratively approximated using a series 
of numerically inexpensive rank-one updates. The approximate Hessian H = V^L(/3) — 
V^-R(/3) is then used in a Newton-like step to yield the next iterate, along with the next 
update to V^R{l3). 

We omit the updating formula itself, along with the strategy by which one ensures that 
— V^-L(/3) is always positive definite and that each iteration monotonically increases the 



posterior density. A full explanation for the general case can be found in Lange (1995). 
Below, however, we show that quasi-Newton acceleration can offer a dramatic speed-up for 
variance-mean mixtures, compared to a naive implementation of expectation-maximization. 



4 Examples 

4.1 Logistic regression 

Suppose we wish to fit a logistic regression where 

n p 

log{l + exp{-yixfp)} + ^ g{Pj \ r) , 
i=i j=i 

assuming that the outcomes yi are coded as ±1, and that r is fixed. To represent the binary 
logistic likelihood as a mixture, let lo~^ have a Polya distribution with a = 1, k = 1/2. 
Proposition [T] gives the relevant conditional moment as 

^ _ 1 / e^' 1\ 

Vi + e^> ~ 2y ■ 



(3 = arg min 



7 



Therefore, if the log prior g satisfies (|3]), then the following three updates will generate a 
sequence of estimates that converges to stationary point of the posterior: 




(9+1) 1 , .(3+1) 2 ' ^ 



where = yixf f3^a\ X^, is the matrix having rows x* = yiXi, and 17 = diag(6t;i, 
and A = diag(Ai, . . . , Xp) are diagonal matrices. 

This sequence of steps resembles iteratively re- weighted least squares due to the presence 
of the diagonal weights matrix il.. But there are subtle differences, even in the unpenalized 
case where Xj = and the solution is the maximum-likelihood estimator. In iteratively re- 
weighted least squares, the analogous weight matrix 17 has diagonal entries = — /Xj), 
where = 1/ {l + e~^i ^) is the estimated value of pr(yi = 1) at each stage of the algorithm. 
These weights arise from the expected information matrix, given the current parameter 
estimate. They decay to zero more rapidly, as a function of the linear predictor xJ (3, than 
the weights in our algorithm. This can lead to numerical difficulties when the successes and 
failures are nearly separable by a hyperplane in W ^ or when the algorithm is initialized far 
from the solution. 

To illustrate this point, we ran a series of numerical experiments with pure maximum- 
likelihood estimation as the goal. In each case we simulated a logistic-regression problem 
with standard normal coefficients Pj. Two different design matrices were used. The first 
problem had p = 100 and n = 10^, and exhibited modest collinearity: each row Xi was 
drawn from a 10-dimensional factor model, Xi = Bfi + Oj. The 100 x 10 factor- loadings 
matrix i3, the factors /j, and the idiosyncratic were all simulated from a standard normal 
distribution. The second problem was larger, but nearly orthogonal: p = 500, n = 5 x 10'^, 
with each Xij drawn independently from a standard normal distribution. 

For each data set, a logistic- regression model was estimated using iteratively re- weighted 
least squares; expectation-maximization, both with and without quasi-Newton acceleration; 
the nonlinear conjugate gradient method; and the nonlinear quasi-Newton method due to 
Broyden, Fletcher, Goldfarb, and Shanno. These last two methods require the gradient of 
the logistic-regression likelihood, which is available in closed form. Further details can be 



found in Chapters 5 and 6 of Nocedal & Wright ( 2000 ) . 



We ran the algorithms from two different starting values: fi^^ = 10 for all j, and 
a random starting location in the hypercube [—1, 1]^. In the latter case the same random 



location was used for all methods. All calculations were performed in R (R Core Team 



2012) on a standard desktop computer with 8 processor cores running at 2-66 gigahertz. 
We avoided the potential ineffiencies of R as much as possible by calling pre-compiled 
routines for multi-threaded matrix operations, non-linear gradient-based optimization, and 
iteratively re-weighted least squares. Code implementing all the experiments is available 
from the authors. 

Table [2] shows the run times for each method. These timings depend upon many fac- 
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Table 2: Run times in seconds on two simulated problems for each logistic-regression 
algorithm. 



Problem 


p = 100, n 


= 10* 


Initial value 


I3f = 10-=* 


Random 


EM 


25-9 


24-8 


Accelerated EM 


0-8 


1-3 


Quasi-Newton BFGS 


2-3 


2-3 


Nonlinear conjugate gradient 


1-8 


1-8 


IRLS 


1-9 


diverged 



p = 



= 500, n 
= 10"^ 
334 
13 
90 
198 
158 



5 X 10* 
Random 
321-7 
22-0 
88-0 
62-1 
diverged 



tors specific to a particular computing environment, including the degree to which the 
sub-routines of each algorithm are able to exploit a multi-core architecture, and the way 
optimization and matrix operations are performed in R. Thus one should not read too much 
into the precise values. 

Nonetheless, some tentative conclusions may be drawn. In both the collinear and nearly 
orthogonal cases, the iteratively re- weighted least-squares algorithm proved sensitive to the 
choice of starting value. It converged when all components of /3 were initialized to 10"'^, but 
not when initialized to a random point in [—1, 1]^. This reflects the fact that a quadratic 
approximation to the logistic likelihood can be poor in regions far from the solution. This 
may lead to an ill-conditioned linear system in the initial iterations of the algorithm, which 
is sometimes severe enough to cause divergence unless some form of trust-region strategy is 
used. 

The basic version of the expectation- maximization algorithm is slow but robust, con- 
verging in all cases. Moreoever, combining expectation-maximization with quasi-Newton 
acceleration led to an algorithm that was equally robust, and faster than all other algorithms 
we tried. 



4.2 Penalized logistic regression 

We also investigated the performance of data augmentation versus iteratively re-weighted 
penalized least squares. For this case we simulated data with a nontrivial correlation struc- 
ture in the design matrix. Let S = BB^ + where i? is a 50 x 4 matrix of standard 
normal random entries, and ^' is a diagonal matrix with Xi random entries. The rows of 
the design matrix X were simulated from a multivariate normal distribution with mean zero 
and covariance matrix S, and the coefficients Pj were standard normal random draws. The 
size of the data set was p = 50 and n = 200. We used a ridge-regression penalty, along with 



the generalized double-Pareto model where p{f3j | r) oc {1 + \ f3j\/{aT)} (^+") (Armagan 



et al. , 2012). This is non-differentiable at zero and is therefore sparsity-inducing, but has 
polynomial tails. It also has a conditionally Gaussian representation, making Proposition 
[T] applicable. 

We chose a = 2, and used each algorithm to compute a full solution path for /3 as a 
function of the regularization parameter, here expressed as log(l/T) in keeping with the 
penalized-likelihood literature. Each solution was initially computed for ri = 10~^, thereby 
constraining all coefficients to be zero or very small. The value of r was then increased 
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Solution path with doubie-Pareto penalty 



Solution path with ridge penalty 




Figure 1: The solution paths for (3 for the double-Pareto and ridge penalties, as a function 
of the regularization parameter log(l/r), for a simulated logistic regression problem. The 
black lines show the solution for iteratively re-weighted penalized least squares; the grey 
lines, for expectation-maximization. The black lines stop where iteratively re-weighted least 
squares fails due to numerical instability. 



along a discrete grid {ri, . . . ,tk = 1000}, using the solution for as the starting value for 
the Tfc+i case. 

As Figure [T] shows, iteratively re- weighted least squares failed when log(l/r) became 
too small, causing the linear system that must be solved at each stage of the algorithm to 
be numerically singular. This happened before all coefficients had entered the model, and 
when 20 out of 200 observations still had fitted success probabilities between 0-05 and 0-95. 

Sparse logistic regression via penalized likelihood is a topic of great current interest 
(Genkin et al. , 2007 Meier et al. , 2008). This problem involves three distinct issues: how to 
handle the logistic likelihood; how to choose a penalty function; and how to fit the resulting 
model. These issues interact in poorly understood ways. For example, coordinate-wise 
algorithms, including Gibbs sampling, can fare poorly in multimodal situations. Nonconvex 
penalties lead to multimodal objective functions, but also, subject to certain regularity 



conditions, exhibit more favorable statistical properties for estimating sparse signals (Fan 



Sz Li, 2001 Carvalho et al. , 2010). Moreover, coordinate descent is tractable only if the 



chosen penalty leads to a univariate thresholding function whose solution is analytically 
available (Mazumder et al. , 2011). This is a fairly narrow class, and does not include most 
of the penalties mentioned in the introduction. 

The question of how to handle the likelihood further complicates matters. For example, 
the area of detail in Figure [T] shows that, for a double-Pareto penalty, the solution paths fit 
by iteratively re-weighted penalized least squares differ in subtle but noticeable ways from 
those fit by expectation-maximization. By checking the maximized value of the objective 
function under both methods, we are able to confirm that iteratively re-weighted penalized 
least squares does not yield the true optimum. Yet we do not entirely understand why, and 
under what circumstances, the methods will differ, and how these differences should affect 
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Table 3: Results of the quantile-regression simulation study. 



Unpenalized Lasso Double Pareto 

Estimation error 17 10 10 

Out-of-sample check loss 764 704 692 

Average model size 25-0 18-2 13-1 



recommendations about what penalty function and algorithm should be used to fit logistic 
regression models. A full study of these issues is beyond the scope of the current paper, 
but is a subject of active inquiry. 

4.3 Penalized quantile regression 

Next, we show how our data-augmentation scheme can be used to fit penalized quantile- 
regression models, and we compare these fits to the corresponding unpenalized estimates. 
Choose p{iOi) to be a generalized inverse-Gaussian prior of unit scale, where {a,K,fi) = 
(1,1 — 2(7,0). This gives — logp{zi) = \zi\ + {2q — l)zi, the pseudo-likelihood which yields 



quantile regression for the qth quantile (Koenker, 2005; Li et al. , 2010). Applying Proposi- 
tion [l| we find that Ui = \yi — xffi^^'^\~^ as the expected value of the conditional sufficient 
statistic needed in the expectation step of our algorithm. 

To study the method, we simulated 50 data sets with p = 25, n = 50, and f3 = 
(5,4,3,2, 1,0, ...,0)"^. In each case the 90th percentile of the data was a linear function 
of /3 with i.i.d. N(0, 1) design points. Noise was added by simulating errors from a normal 
distribution whose 90th percentile was the linear predictor xff3, and whose variance was 
o"^ = 5^. For each data set, we fit three models: traditional quantile regression using the R 



package from Koenker (2011), along with quantile regression under the lasso penalty and 
the generalized double-Pareto penalty with a = 3. For the second and third method, the 
scale of regularization r was chosen by cross validation. Performance was measured by 
squared-error loss in estimating /3, and out-of-sample check loss on a new data set of the 
same size, simulated from the same model. 

The results are in Table [3j Both regularized versions of quantile regression for the 
90th percentile seem to outperform the straight estimator. No significant differences in 
performance emerged between the lasso and double-Pareto penalties, although the double- 
Pareto solutions were systematically more sparse. 

5 Discussion 

Our primary goal in this paper has been to show the relevance of the conditionally Gaussian 
representation of Q-Q, together with Proposition [l| for fitting a wide class of regularized 
estimators within a unified variance-mean mixture framework. We have therefore focused 
only on the most basic implementation of the expectation-maximization algorithm, together 
with quasi-Newton acceleration. 
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There are many variants of the expectation-maximization algorithm, however, some of 



which can lead to dramatic improvements (Meng & van Dyk, 1997 Gelman et al. , 2008b). 



These variants include parameter expansion (Liu & Wu, 1999), majorization-minimization 



(Hunter & Lange, 2000), the partially collapsed Gibbs sampler (van Dyk Sz Park, 2008), 



and simulation-based alternatives (van Dyk & Meng 2010). Many of these modifications 
require additional analytical work for particular choices of g and /. One example here 



includes the work of Liu (2005) on the robit model. We have not explored these options 



here, and this remains a promising area for future research. 

A second important fact is that, for many purposes, such as estimating /3 under squared- 
error loss, the relevant quantity of interest is the posterior mean rather than the mode. 



Indeed, both Hans (2009) and Efron (2009) argue that, for predicting future observables, the 



posterior mean of (3 is the better choice. The following proposition represents the posterior 
mean for /? in terms of the score function of the predictive distribution, generalizing the 



results of Brown (1971), Masreliez (1975), Pericchi & Smith (1992), and Carvalho et al. 



(2010). There are a number of possible versions of such a result. Here we consider a 
variance-mean mixture prior p{/3j) with a location likelihood p{y — /?), but a similar result 
holds the other way around. 

Proposition 3. Let p{\y — /3j\) be the likelihood for a location parameter (5j, symmetric in 
y — P, and let p{f3j) = J (j){/3j; + K^/Aj, r^/Aj) p{\~^) dXj^ be a normal 
mixture prior. Define the following quantities: 

m{y) = [ p{y - f3j)pi(3,) d(3j , p^Xf] 



vanance-mean 



'f,^i + K/\j,T^/\,)p\\J^) 



E{\,) 
m*{y) = 



Then 



E{f3j I y) 



Kp ^ \ fipE{X,) \ \m*{y)\ ^ \E{X^) \ \m*{y) \ \ d log m*{y) 



V 



m{y) 



V 



m(y) 



dy 



The generalization to nonorthogonal designs is straightforward, following the original 



Masreliez (1975) paper. See, for example, Griffin & Brown (2010), along with the discussion 



of the Tweedie formula by Efron (2011) 



Computing the posterior mean will typically require sampling from the full joint poste- 
rior distribution over all parameters. Our data-augmentation approach can lead to Markov- 



chain Monte Carlo sampling schemes for just this purpose (e.g. Gelman et al. , 2008a). The 
key step is the identification of the conditional posterior distributions for Xj and Ui. We 
have made some initial progress for logistic and negative-binomial models, described in a 
technical report available from the second author's website. This remains an active area of 
research. 
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A Appendix A: distributional results 
A.l Generalized hyperbolic distributions 

In all of the following cases, we assume that {9 \ v) ~ A^(/i + k,v,v), and that v ~ piv). 
Let p{v I ijj, 7, 6) be a generalized inverse-Gaussian distribution, following the notation of 



Barndorff-Nielsen Sz Shephard (2001). Consider the special case where if) = 1 and 5 = 0, in 



which case p{9) is a hyperbolic distribution having density 



p{e I n,a,K) 



a- 



2a 



exp {— a|0 — ^Ji\ + k{9 — n)} 



When viewed as a pseudo-likelihood or pseudo-prior, the class of generalized hyperbolic 
distributions will generate many common objective functions. Choosing (a, k, fi) = (1, 0, 0) 
leads to — logp(/3j) = \ f3j\, and thus regularization. Choosing (a, k, fi) = (1, l—2q, 0) gives 
— logp{zi) = \zi\ + {2q — l)zi. This is the check-loss function, yielding quantile regression 
for the qth quantile. Finally, choosing (a, k, fi) = (1, 1, 1) leads to the maximum operator: 

-{1/2) log p{zi) = (1/2)|1-Zi| + (1/2)(1-Zi) = max(l-zi,0), 



where Zi = yixj (5. This is the objective function for support vector machines (e.g. Mallick 



et al. , 2005 Poison & Scott, 2011b), and corresponds to the limiting case of a generalized 



inverse-Gaussian prior. 



A. 2 Z distributions 

Let P'p{v I Q,a — 2k) be a Polya distribution, which can be represented as an infinite 



convolution of exponentials, and leads to a Z distributed marginal (Barndorff-Nielsen et al 



1982). The important result is the following: 



Pz{ 



//, a, K) 



5(a,K) (l + e^-f')2{"-«^) Jo 



M {n + Kv, v) pp{v \ a, a — 2k) dv . 



For logistic regression, choose {a,K,^) = (1,1/2,0), leading to p[zi) = e^*/(l + e^^) 
with Zi = UixJ (3. This corresponds to a limiting improper case of the Polya distribution. 
The necessary mixture representation still holds, however, by applying the Fatou-Lebesgue 



theorem (Gramacy & Poison, 2012). 



For the multinomial generalization of the logistic model, we require a slight modifica- 
tion. Suppose that yi G {1, • • . ,K} is a category indicator, and that (5^ = {Pkii ■ ■ ■ , Pkp)'^ 
is the block of p coefficients for the kth. category. Let 77,^ = exp {x f/3k ~ Qfc) /{I + 



exp - Cifc)}, where Cik{P-k) = log | exp(xf We follow 



Holmes & Held 
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(2006), writing the conditional likelihood for Pk as 



n K ri 
i=l 1=1 1=1 

„Ta „ \Kyi=k) 

-f—f I CJi-p {J^ 

oc 

i=l 



jj I exp {xfPk - CikY 

I + exp {xj f3k - Cik) 



where Wi is independent of and I is the indicator function. Thus the conditional likelihood 
for the kth block of coefficients given all the other blocks of coefficients /3_fc, can be 
written as a product of n terms, the ith term having a Polya mixture representation with 
K-ik = = k) — 1/2 and fiik = Cik{f3-k)- This allows regularized multinomial logistic 
models to be fit using the same approach of Section 4.1, with each block /3fc updated in a 
conditional maximization step. 

B Appendix B: Proofs 

B.l Proposition [l] 

Since (/> is a normal kernel, 

d(j){l3j \ fl/s + Ki3/Xj,T'^/Xj) _ f l^j - - Kfs/Xj\ 2 



We use this fact to differentiate 



oo 



P{Pi \^)= f 4>{P3 I /^z? + t^plXj,T /Xj) p{Xj I r) dXj 







under the integral sign to obtain 



Ql^-Pi^j ' " io Wj ^'^^^^ ' ^ ''/'/Aj^r /Aj)} p(Aj I r) dA^ . 
Dividing by p{Pj \ t) and using ([7| for the inner function, we get 



Thus 

Equivalently, in terms of the penalty function — logp(/3j | r), 



9 

- ///3)S (Aj I 13 j) = Kp-T^— logpiPj I r) 
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By a similar argument, 



.2 d 



{zi - Hz)E {uji I (3, Zi, a) = Kz- a — \ogp{zi | /3, ct) . 

We obtain the result using the identities 

did, 

— \ogp[zi I j3i) = -f'{zi I /3, a) and — logp(/3j | r) = -g'{l3j \ r) . 

B.2 Proposition [2] 

We derive the expressions for a regression problem, with those for classification involving 
only small changes. Begin with Equation ([5]). Collecting terms, we can represent the log 
posterior, up to an additive constant not involving (3, as a sum of quadratic forms in /3: 

logp{(3 \ u},X,T,a,z) = - ^ (^{y - - HzU}~^} - X 13^ Q (^{y - fi^l - k^uj-^} - X 

recalling that uj^^ and A^^ are column vectors. This is the log posterior under a normal prior 
/? ~ N(/i^l + K^A^"*^, r^A~^) after having observed the working response y — fJ-z^ — KzU!~^. 
The identity + kuj^^) = fiuj + k1 then gives the result. 

For classification, on the other hand, let be the matrix with rows x* = yiXi. The 
kernel of the conditionally normal likelihood then becomes — /x^l — k^cj"^)-^ i7 (X*/3 — 
/x^l — KzU:~^)- Hence it is as if we observe the n-dimensional working response Hz^ + k^oj^^ 
in a regression model having design matrix X^. 

B.3 Proposition [3] 

Our extension of Masreliez's theorem to variance-mean mixtures follows a similar path as 
Proposition [TJ Since (f) \s a. normal kernel, we may apply ([7]), giving 

1 fl I , /X 2/x N ^^f}-l^|3 d(l){f3j \ + Kp/Xj,T^/Xj) 

^m^, I + K,/A„r /A,) = — . 

The rest of the argument follows the standard Masreliez approach. 
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